Learning Clusterwise Similarity with First-Order Features

نویسنده

  • Aron Culotta
چکیده

Many clustering problems can be reduced to the task of partitioning a weighted graph into highly-connected components. The weighted edges indicate pairwise similarity between two nodes and can often be estimated from training data. However, in many domains, there exist higherorder dependencies not captured by pairwise metrics. For example, there may exist soft constraints on aggregate features of an entire cluster, such as its size, mean or mode. We propose clusterwise similarity metrics to directly measure the cohesion of an entire cluster of points. We describe ways to learn a clusterwise metric from labeled data, using weighted, first-order features over clusters. Extending recent work equating graph partitioning with inference in graphical models, we frame this approach within a discriminatively-trained Markov network. The advantages of our approach are demonstrated on the task of coreference resolution. 1 Clusterwise Similarity The input to a clustering algorithm is often a weighted undirected graph, where the weight between two nodes indicates their similarity. The goal of clustering is to partition the graph into components with heavy intra-cluster edges and light inter-cluster edges. Recently, these weights have been estimated from training data using maximum likelihood [1, 2]. While effective, by factoring the similarity metric into a set of pairwise functions, this approach sacrifices expressivity for tractability. For many domains, there exist clusterwise soft constraints that cannot be represented by pairwise functions. Consider the task of coreference resolution (also called identity uncertainty or deduplication), in which mentions referring to the same underlying object are clustered together. For example, given a database of research paper citations, we would like to cluster together citations referring to the same paper. Examples of clusterwise constraint include (1) a paper is rarely referenced more than 100 times or (2) an author’s name is unlikely to be misspelled 5 different ways. To represent these clusterwise constraints, we propose using similarity metrics to measure the compatibility of a cluster of nodes, rather than simply pairs of nodes. The challenge lies in constructing efficient methods to estimate these metrics from training samples and to partition the resulting graph. Given a deduplicated training database, we estimate the clusterwise metric by sampling positive and negative example clusters. We then specify a set of first-order predicates as features to describe the compatibility of the nodes in each cluster. Each of these predicates has an associated weight, which is estimated by maximizing the conditional log-likelihood of the training data. We approximate the optimal partitioning of a graph with an agglomerative algorithm that greedily merges clusters based on their predicted compatibility scores. We apply our technique to deduplicate authors and citations in a publications database and find that the clusterwise metric achieves higher F1 scores than the pairwise metric on 5 of 7 datasets. For more details, we refer the reader to our technical report [3].

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

مقایسه روش‌های مختلف یادگیری ماشین در خلاصه‌سازی استخراجی گفتار به گفتار فارسی بدون استفاده از رونوشت

In this paper, extractive speech summarization using different machine learning algorithms was investigated. The task of Speech summarization deals with extracting important and salient segments from speech in order to access, search, extract and browse speech files easier and in a less costly manner. In this paper, a new method for speech summarization without using automatic speech recognitio...

متن کامل

A New Similarity Measure Based on Item Proximity and Closeness for Collaborative Filtering Recommendation

Recommender systems utilize information retrieval and machine learning techniques for filtering information and can predict whether a user would like an unseen item. User similarity measurement plays an important role in collaborative filtering based recommender systems. In order to improve accuracy of traditional user based collaborative filtering techniques under new user cold-start problem a...

متن کامل

Efficient calculation of sentence semantic similarity: a proposed scheme based on machine learning approaches and NLP techniques

Aim of Study Sentence semantic similarity plays a crucial role in a variety of applications such as Machine Translation, Information Retrieval, Question Answering and Multi-document Summarization. Considering the variability of natural language expression, sentence semantic similarity detection is not a trivial task. This paper tries to make use of Natural Language Processing (NLP) as well as m...

متن کامل

Fuzzy clusterwise linear regression analysis with symmetrical fuzzy output variable

The traditional regression analysis is usually applied to homogeneous observations. However, there are several real situations where the observations are not homogeneous. In these cases, by utilizing the traditional regression, we have a loss of performance in fitting terms. Then, for improving the goodness of fit, it is more suitable to apply the so-called clusterwise regression analysis. The ...

متن کامل

The Impact of Presentation Order on Category Learning Strategies: Behavioral Data and Self-Reports

The presentation order in supervised categorization learning can influence the category representation. For example, the order can bias a rule-based approach focusing the identification of relevant features or an exemplar-based approach focusing the similarity of category members. In a blocked design stimuli can either be presented in a way that relevant features over stimuli become obvious or ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005